Tennis Through The Decades¶

- By Team SmithServe¶

Introduction¶

SmithServe is going to perform an extensive examination of ATP sports data from 2012 to 2022. The main goal is to give entertainment firms and brands looking for sponsorship information into professional tennis players' performances. The analysis will provide data-driven reports and visualizations that will provide practical insights which will enable wise decision-making in the fast-paced world of professional tennis.

Our Clients¶

Sports Entertainment Data¶

SmithServe's ATP sports data analysis serves the needs of prominent sports entertainment broadcasters—ESPN in the USA and BeIN Sports in Europe.

ESPN Utilization:¶

Telecasting Enhancement and Anchor Discussions :

ESPN can leverage the analysis for enhanced telecasting by providing detailed insights into close calls and player statistics before each game. The data will provide ESPN anchors with comprehensive player information for in-depth discussions during their coverage.

BeIN Sports Utilization:¶

Player Stats and Betting Insights:

BeIN Sports, is an European sports channel that targets consumers who prefer information about a players statistics so that they can use the analysis for betting. Player statistics can be displayed during or before each game. Betting is very popular in Europe, the data SmithServe provides will help give insights that viewers can benefit from which will help improve their experience with BeIN Sports compared to other sport broadcasters in Europe.

Betting Data¶

Smith Serves ATP sports data analysis serves the needs of DraftKings Sportsbook, a leading betting company.

DraftKings Sportsbook Utilization:¶

Informed Betting Decisions and Revenue Optimization:

The analysis shows information on player profiles and performance in various games, providing DraftKings Sportsbook clients a player's probability of winning a game. This helps betters to more informative precise decisions. The analysis will help DraftKings Sportsbook's revenue when it helps clients make better bets which will then make sure that they keep on returning and place more bets through DraftKings Sportsbook because it will be more popular.

Sponsorship Data¶

Brand Association:¶

Nike, Under Armour, Lactose, and Sergio Tacchini will want to make sure that they associate their brands with with the correct player. Picking a player who will expand and upvote their brand is the goal therefore they will want to make sure to pick a player who has excellence, precision, and great performance.

Data Source¶

SmithServe was able to gather data through Kaggle about ATP (Association of Tennis Professionals) sports that spans from 1968 through 2023 (https://www.kaggle.com/datasets/warcoder/atp-tennis-rankings-results-and-stats1968-2023). However, SmithServe focused on data that range from 2001 - 2023 in order to produce more relevant and current information.

Description¶

The data specifically contains detailed information about every minute detail related to tennis and that explains the 49 columns and almost 7000 observations over the years spanning from 1968 to 2023. The columns present in the dataset are:

Columns Description
tourney_id Tennis tournament ID
tourney_name Name of the tournament
surface Type of court ground on which the tournament was played
Draw_size Number of players playing in the tournament
tourney_level Level of the tournament being played: whether it is a grand-slam or an Masters tournament or an ATP 500 and so on
tourney_date Date the tournament was played on
match_num Match number within the tournament
winner_id Winner seed data
Winner_seed Seeding of the winning player
Winner_entry Special tournament entry condition for the winner
Winner_name Name of the winning player
Winner_hand The hand of the winning player (R/L/U)
Winner_ht Height of the winning player
Winner_ioc Country of origin of the winning player
Winner_age Indicates the age of the winning player
Loser_id Loser seed data
Loser_seed Seeding of the losing player
Loser_entry Special tournament entry condition for the loser
Loser_name Name of the losing player
Loser_hand The hand of the losing player (R/L/U)
Loser_ht Height of the losing player
Loser_ioc Country of origin of the losing player
Loser_age Indicates the age of the losing player
Score Final score of the game
Best_of Number of sets played in a game
Round Round that was played
minutes Number of minutes a match was played for
W_ace Number of aces won in a single match
w_df Winning double faults while serving for the winner of the game
w_svpt Number of service points won by the winning player
w_1stIn Number of serves that have landed in for the winner
w_1stWon How many points the winner won when their first serve landed in
w_2ndWon How many points the winner won when their second serve landed in
w_SvGms Total number of service games played by the winner
w_bpSaved Number of times the winner was able to win the point when their opponent had a chance to break their serve
w_bpFaced Total number of times the winner faced a breakpoint
l_ace How many aces the loser hit
l_df Winning double faults while serving for the losing player of the game
l_svpt Number of service points won by the losing player
l_1stIn Number of serves that have landed in for the losing player
l_1stWon How many points the losing player won when their first serve landed in
l_2ndWon How many points the losing player won when their second serve landed in
l_SvGms Total number of service games played by the losing player
l_bpSaved Total number of times the losing player saved a breakpoint
l_bpFaced Total number of times the losing player faced a breakpoint
Winner_rank The rank of the winning player amongst the active players
Winner_rank_points Total number of ranking points that the winning player has accumulated at the time of a particular match
Loser_rank Rank of the losing player amongst the active players
Loser_rank_points Total number of ranking points that the losing player has accumulated at the time of a particular match

However, filtering and cleaning of this original data is required for better understanding and further analysis. Since it is a huge dataset, many columns are not relevant to our analysis and hence might be required to be dropped to make our analysis more specific. Also, the column names are very specific to Tennis terminology which needs to be changed in human readable terms for making better decisions for the analysis. All these tasks are performed during the data cleaning process.

Hence, after dropping the above mentioned columns, the remaining columns were decided to be renamed as it would be much easier to understand the terms better and then conduct the analysis.

Main Goals¶

Do the top players perform better on any specific type of playing surface, and what effect does the surface have on the total score? For games played between right-handed and left-handed players, what is the percentage of victories for each handedness on different court types?¶

  • Sports brands like Nike and Under Armour rely heavily on data-driven insights to optimize their products and performance. This question explores how insights gathered from analyzing playing surfaces can guide the development of innovative and effective sports equipment, specifically focusing on tennis players. Playing surfaces can significantly impact players' performance in various ways. Understanding these impacts is crucial for sports brands to develop equipment that delivers optimal performance across different surfaces.
  • By understanding the nuances of each surface, brands can create equipment that maximizes performance. For example, developing shoes with improved traction for clay courts or lighter rackets for faster surfaces. By leveraging data-driven insights from playing surfaces, sports brands can develop highly effective and personalized equipment that empowers athletes to excel across all playing conditions. This approach not only boosts player performance but also strengthens brand reputation and drives competitive advantage.

How significantly does a player's age impact their game performance, and can we associate the success of younger players and the difficulties faced by older players with factors like stamina and match duration?¶

  • This question explores how data-driven insights derived from player performance can benefit three key stakeholders:
    • Telecasting Networks: Analyze factors like age, stamina, and match duration to determine their impact on viewership and optimize scheduling decisions.
    • DraftKings (Betting Company): Utilize insights to make informed decisions about player performance and assess the viability of betting on certain players, particularly older athletes.
    • Sports Fans: Gain a deeper understanding of the factors influencing player performance and match complexity, leading to a more engaging and informed viewing experience.
  • Imagine analyzing data across various age groups and revealing that younger players tend to perform better in shorter matches, while older players excel in longer, more strategic matches. These insights can be used by our clients in the following ways:
    • Telecasting Networks to schedule shorter matches featuring younger players during peak viewing hours to attract a larger audience.
    • DraftKings where they can increase betting odds on older players in longer matches based on their predicted performance.
    • Sports Fans can gain a deeper understanding of why certain players perform better under specific circumstances, making their viewing experience more insightful.

During the clash of the Legends i.e Djokovic/ Nadal or Djokovic/Federer or Federer/Nadal, which player proved to be a winner majority of the time against their opponent? Based on the overall success ratio, success ratio geographically and the type of surface, generate insights which player can be most reliable for brand sponsorships?¶

  • Sponsorship companies such as Lactose and Sergio Tacchini will want to sponsor a player that has meaning to the brand. Knowing which tournament a player wins the most will allow the brand who is from that area to decide which player they should associate with.

  • Sports entertainment companies like our clients ESPN and BIeN Sports will need the most reliable information about the top players when they are reporting and debating tennis matches. The anchors will be able to tell a story on how a certain player wins more or if the players are a tie most of the time. This will differentiate the broadcasters from different sports channels. The differentiation will help bring in viewers because the anchors are providing more information on the players and relating back to historical data that the analysis will provide.

Does the number of aces (ace means when a serve is in but not hit at all by the receiving player) served by a player in a game directly depend on the outcome? (more aces = winner)? Why is that so?¶

  • Betting companies like our client DraftKings Sportsbook can provide historical data of a player's game in order to help betters make precise decisions for the future. Knowing the pattern history of how often a winner or a loser has an ace will allow better to know who has a higher probability of winning the game. By providing this information it will make the betting process more efficient and will attract more betters to DraftKings Sportsbook.

Which player has a higher probability of winning the game based on features like total number of aces, number of breakpoints saved by the winner, number of breakpoints saved by the loser and the type of playing surface?¶

  • Using predictive machine learning models like RandomForestClassifier, we will try to simulate a match between 2 players and predict the winner. We will train the model on features like - Total number of Aces, Total Breakpoints saved by Winner and Loser and The Type of Surface. After executing the model, we find a 99% accuracy and a prediction of the winner.

Grading - Data Analysis¶

Our work on data analysis answers important questions regarding players’ performance on all ground types and players’ stats analysis to predict winners in future matches. We’ve implemented a machine learning model to go above and beyond to predict a winner between two user-entered player names on a particular ground type. This not only allowed us to answer the analysis questions posed in the project but also to make predictions about future data. Additionally, we’ve also worked on building a comprehensive documentation, including clear steps for data analysis with comments, markdowns and discussion on inferences based on our findings. Instead of simply presenting the results in a table or chart, we’ve interpreted the results in the context of the research question, discussed their implications, and suggested areas for future research. We believe our analysis adds value to the project and demonstrates a high level of competence in data analysis.

Data Integration of ATP Stats from 2001 - 2023¶

We have multiple datasets corresponding to the different years of the ATP (Association of Tennis Professionals) stats, ranging from the years 2001 - 2023 (about 2 decades). Our objective is to bring this data together in a simple and unified view by integrating and consequently cleaning it.

Benefits of Data Integration¶

By integrating these datasets, we achieve several benefits:

  • Comparative Studies: Compare ATP data across 2001 - 2023 to analyze the progression of different players, compare techniques and various other factors.
  • Data Consistency: This process can ensure that all the data is consistent for any analysis or visualization.
  • Accessibility: We end up creating a single csv file that contains all the data and is easier to access.
  • Makes Cleaning Process Easier: Cleaning every individual dataset would be taxing, using OS python library can automate this process.

Process Outline¶

The integration process includes the following steps:

  1. Data Collection: Compile all the individual yearly datasets into a common folder or directory.
  2. Data Merging: We use OS python library , generally used for miscelleanous operating system functions along with some other functions (.endswith(suffix)). (documentation for OS: https://docs.python.org/3/library/os.html). From this library, we use these functions:
  • os.listdir(directory): This function is used to list all the files inside any specific directory.
  • os.path.join(directory, filename): this function concatenates directory with filename.

  • .endswith(suffix): This is a string manipulation method used in python, this checks if a string ends witth a specific suffix. We use this to to check for .csv files in a given directory.

Final Cleaning Output¶

The final output will be a comprehensive ATP stats dataset that covers matches, players' statistics, rankings, and other relevant information from 2001 to 2023. This dataset will serve as a common dataframe for future analysis by researchers, sports analysts, commentators, and tennis enthusiasts.

Importing necessary python libraries¶

Pandas:¶

(documentation for pandas: https://pandas.pydata.org/docs/user_guide/index.html) This is a high-level data analysis and manipulation library used in python. his library provides a wide array of functions that compliment numpy library.

Numpy:¶

(documentation for numpy: https://numpy.org/doc/1.26/user/index.html#user) Numpy is a fundamental package used for scientific computing in python. There are many functions inside this library that can help.

In [1]:
# Import statements
import os
import numpy as np
import pandas as pd
from numpy import nan as NA
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objs as go
import dash
from dash import html, dcc, Input, Output
In [2]:
# Loading the dataset directory:
atp_directory = r'D:\University of Maryland\ALL SEMESTERS\FALL 2023\TERM A\BUDT 704 - Data Processing and Analysis using Python\Group Project\tennis_atp_data'
combined_atp_filename = r'D:\University of Maryland\ALL SEMESTERS\FALL 2023\TERM A\BUDT 704 - Data Processing and Analysis using Python\Group Project\combined_atp_data.csv'
In [3]:
# Merging the datasets into one:
dataframes = []

# Iterating each CSV file present in the ATP directory
for filename in os.listdir(atp_directory):
    if filename.endswith('.csv'):
        # Construct the full file path
        file_path = os.path.join(atp_directory, filename)
        
        # Extract the year from the filename (assuming it's in the filename)
        year = int(filename.split('.')[0][-4:])
        
        # Read the CSV file
        df = pd.read_csv(file_path)
        
        # Append a year column
        df['year'] = year
        
        # Append the DataFrame to the list
        dataframes.append(df)

# Concatenate all the DataFrames in the list
combined_df = pd.concat(dataframes)

# Sort the DataFrame based on the year column
combined_df.sort_values('year', inplace = True)

# If you want to set the year as an index
combined_df.set_index('year', inplace = True)

# Save the combined DataFrame to a new CSV file
combined_df.to_csv(combined_atp_filename, index = True)

print(f"All CSV files have been combined into {combined_atp_filename}")
All CSV files have been combined into D:\University of Maryland\ALL SEMESTERS\FALL 2023\TERM A\BUDT 704 - Data Processing and Analysis using Python\Group Project\combined_atp_data.csv
In [4]:
# Loading the entire tennis ATP data:
tennis_data = pd.read_csv(r'D:\University of Maryland\ALL SEMESTERS\FALL 2023\TERM A\BUDT 704 - Data Processing and Analysis using Python\Group Project\combined_atp_data.csv')
tennis_data
Out[4]:
year tourney_id tourney_name surface draw_size tourney_level tourney_date match_num winner_id winner_seed ... l_1stIn l_1stWon l_2ndWon l_SvGms l_bpSaved l_bpFaced winner_rank winner_rank_points loser_rank loser_rank_points
0 2001 2001-301 Auckland Hard 32 A 20010108 1 102905 NaN ... 31.0 19.0 15.0 10.0 4.0 8.0 52.0 755.0 14.0 1598.0
1 2001 2001-520 Roland Garros Clay 128 G 20010528 70 103990 NaN ... 29.0 17.0 14.0 12.0 9.0 16.0 83.0 506.0 205.0 174.0
2 2001 2001-520 Roland Garros Clay 128 G 20010528 71 103694 NaN ... 61.0 44.0 20.0 19.0 12.0 20.0 70.0 609.0 21.0 1295.0
3 2001 2001-520 Roland Garros Clay 128 G 20010528 72 102338 7.0 ... 87.0 54.0 23.0 21.0 12.0 20.0 7.0 2605.0 112.0 371.0
4 2001 2001-520 Roland Garros Clay 128 G 20010528 73 103507 4.0 ... 38.0 21.0 8.0 11.0 17.0 24.0 4.0 2830.0 220.0 161.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
67703 2023 2023-0403 Miami Masters Hard 128 M 20230320 275 111460 NaN ... 49.0 35.0 15.0 11.0 7.0 9.0 79.0 705.0 55.0 865.0
67704 2023 2023-0403 Miami Masters Hard 128 M 20230320 274 106421 4.0 ... NaN NaN NaN NaN NaN NaN 5.0 4330.0 56.0 865.0
67705 2023 2023-0403 Miami Masters Hard 128 M 20230320 273 202103 25.0 ... 33.0 26.0 8.0 10.0 2.0 6.0 31.0 1320.0 6.0 3415.0
67706 2023 2023-0403 Miami Masters Hard 128 M 20230320 279 206173 10.0 ... 32.0 23.0 7.0 10.0 6.0 11.0 11.0 2925.0 27.0 1450.0
67707 2023 2023-M-DC-2023-WG2-PO-TUN-CYP-01 Davis Cup WG2 PO: TUN vs CYP Hard 4 D 20230203 5 121411 NaN ... NaN NaN NaN NaN NaN NaN 364.0 131.0 990.0 11.0

67708 rows × 50 columns

Data Cleaning of ATP Stats from 2001 - 2023:¶

Data cleaning is a crucial step in the data analysis process. It involves preparing any raw dataset for future analysis by handling the missing values, duplicates, and making the data make sense. Below, we explain common data cleaning operations using Python's pandas and numpy libraries.

  • Handling NaN values: There are multiple ways of handling NaN values, either we can drop them using the .dropna() function or fill them up with a given value by using .fillna(value). Let us do a descriptive analysis and figure out which of these functions would benefit the analysis most.
  • Dropping unwanted columns: Sometimes datasets contain extra information and makes the data look overwhelming. To delete/drop columns from a DataFrame in pandas, you can use the .drop() method. You can specify one or more column names to be removed.
  • Renaming columns for better understanding: Column names should be precise and easy to understand. The names should follow a similar nanming convention. To rename any dataset, we use .rename() function.
  • Looking for Duplicates: Duplicate data in any analysis can skew it and alter the analysis. Identifying duplicates ca be done using .duplicated() function.
In [5]:
# Check for NaN percentage for every variable:
nan_percentage = tennis_data.isna().mean() * 100
print(nan_percentage)
year                   0.000000
tourney_id             0.000000
tourney_name           0.000000
surface                0.000000
draw_size              0.000000
tourney_level          0.000000
tourney_date           0.000000
match_num              0.000000
winner_id              0.000000
winner_seed           58.210256
winner_entry          87.404738
winner_name            0.000000
winner_hand            0.013292
winner_ht              2.241980
winner_ioc             0.000000
winner_age             0.005908
loser_id               0.000000
loser_seed            76.924440
loser_entry           79.476576
loser_name             0.000000
loser_hand             0.062031
loser_ht               4.538607
loser_ioc              0.000000
loser_age              0.008862
score                  0.000000
best_of                0.000000
round                  0.000000
minutes               11.076978
w_ace                  8.811366
w_df                   8.811366
w_svpt                 8.811366
w_1stIn                8.811366
w_1stWon               8.811366
w_2ndWon               8.811366
w_SvGms                8.809890
w_bpSaved              8.811366
w_bpFaced              8.811366
l_ace                  8.811366
l_df                   8.811366
l_svpt                 8.811366
l_1stIn                8.811366
l_1stWon               8.811366
l_2ndWon               8.811366
l_SvGms                8.809890
l_bpSaved              8.811366
l_bpFaced              8.811366
winner_rank            0.734034
winner_rank_points     0.734034
loser_rank             1.949548
loser_rank_points      1.949548
dtype: float64

Dropping Unwanted Columns¶

When we analyzed the data, we saw several columns that we were interested in dropping. The reasons for this vary. For some of the data, most of the values were invalid, either N/A, null, or NaN, and thus the entire column was worthless for analysis. For others, the column was not relevant to our analysis, and as such also had no value to include. They were removed simply for the benefit of having a leaner data set to work with.

  • winner_seed, loser_seed: These columns had approximately 58% and 77% null values respectively, so it was not possible to impute data, nor would filtering out the null values provide any value.
  • winner_entry, loser_entry: These columns had approximately 87% and 79% null values respectively, and as above, there was no meaningful way to include these variables.
  • w_df, l_df: These columns are related to the double faults in a game, when the server hits the net and forfeits the point. These are also not relevant to our analysis and will be removed.
  • tourney_date: The Tournament date column just indicates the date that a tournament took place. We won't be doing analysis that uses this data so this will be removed as well.
In [6]:
# Dropping unwanted columns
tennis_data.drop(['winner_seed', 'loser_seed', 'winner_rank_points', 'loser_rank_points',
                  'winner_entry','loser_entry', 'w_df', 'l_df', 'tourney_date', 'score',
                  'draw_size', 'match_num', 'round', 'best_of', 'w_1stWon', 'w_2ndWon',
                  'l_1stWon', 'l_2ndWon'], axis = 1, inplace = True)

Renaming columns¶

Here, we clean up the column names to make them more human-readable.

In [7]:
# Changing column names
tennis_data.rename(columns={
    'tourney_id': 'Tournament ID',
    'tourney_name': 'Tournament Name',
    'surface': 'Type of Surface',
    'tourney_level': 'Level of Tournament',
    'winner_id': 'Winner ID',
    'winner_name': 'Name of Winner',
    'winner_hand': 'Winning Player\'s Hand',
    'winner_ht': 'Winning Player\'s Height (in cm)',
    'winner_ioc': 'Winning Player\'s Country',
    'winner_age': 'Winner\'s Age',
    'loser_id': 'Loser ID',
    'loser_name': 'Name of Losing Player',
    'loser_hand': 'Losing Player\'s Hand',
    'loser_ht': 'Losing Player\'s Height (in cm)',
    'loser_ioc': 'Losing Player\'s Country',
    'loser_age': 'Loser\'s Age',
    'minutes': 'Game Duration (minutes)',
    'w_ace': 'Winner Aces',
    'w_svpt': 'Winner Service Points',
    'w_1stIn': 'Winner First Serves In',
    'w_SvGms': 'Winner Service Games Played',
    'l_ace': 'Loser Aces',
    'l_svpt': 'Loser Service Points',
    'l_1stIn': 'Loser First Serves In',
    'l_SvGms': 'Loser Service Games Played',
    'winner_rank': 'Winning Player\'s Rank',
    'loser_rank': 'Losing Player\'s Rank',
    'year': 'Year',
    'w_bpSaved': 'Winner Break Points Saved',
    'w_bpFaced': 'Winner Break Points Faced',
    'l_bpFaced': 'Loser Break Points Saved',
    'l_bpSaved': 'Loser Break Points Faced'
}, inplace=True)
tennis_data
Out[7]:
Year Tournament ID Tournament Name Type of Surface Level of Tournament Winner ID Name of Winner Winning Player's Hand Winning Player's Height (in cm) Winning Player's Country ... Winner Break Points Saved Winner Break Points Faced Loser Aces Loser Service Points Loser First Serves In Loser Service Games Played Loser Break Points Faced Loser Break Points Saved Winning Player's Rank Losing Player's Rank
0 2001 2001-301 Auckland Hard A 102905 Stefan Koubek L 175.0 AUT ... 14.0 16.0 5.0 66.0 31.0 10.0 4.0 8.0 52.0 14.0
1 2001 2001-520 Roland Garros Clay G 103990 Tommy Robredo R 180.0 ESP ... 0.0 1.0 0.0 70.0 29.0 12.0 9.0 16.0 83.0 205.0
2 2001 2001-520 Roland Garros Clay G 103694 Olivier Rochus R 168.0 BEL ... 9.0 16.0 14.0 118.0 61.0 19.0 12.0 20.0 70.0 21.0
3 2001 2001-520 Roland Garros Clay G 102338 Yevgeny Kafelnikov R 190.0 RUS ... 13.0 21.0 6.0 150.0 87.0 21.0 12.0 20.0 7.0 112.0
4 2001 2001-520 Roland Garros Clay G 103507 Juan Carlos Ferrero R 183.0 ESP ... 1.0 1.0 1.0 71.0 38.0 11.0 17.0 24.0 4.0 220.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
67703 2023 2023-0403 Miami Masters Hard M 111460 Quentin Halys R 191.0 FRA ... 6.0 6.0 9.0 82.0 49.0 11.0 7.0 9.0 79.0 55.0
67704 2023 2023-0403 Miami Masters Hard M 106421 Daniil Medvedev R 198.0 RUS ... NaN NaN NaN NaN NaN NaN NaN NaN 5.0 56.0
67705 2023 2023-0403 Miami Masters Hard M 202103 Francisco Cerundolo R 185.0 ARG ... 1.0 2.0 3.0 61.0 33.0 10.0 2.0 6.0 31.0 6.0
67706 2023 2023-0403 Miami Masters Hard M 206173 Jannik Sinner R 188.0 ITA ... 1.0 3.0 1.0 58.0 32.0 10.0 6.0 11.0 11.0 27.0
67707 2023 2023-M-DC-2023-WG2-PO-TUN-CYP-01 Davis Cup WG2 PO: TUN vs CYP Hard D 121411 Moez Echargui R 180.0 TUN ... NaN NaN NaN NaN NaN NaN NaN NaN 364.0 990.0

67708 rows × 32 columns

Duplicate check¶

After dropping some columns and renaming the columns, we want to make sure there are no duplicate observations present to skew the analysis.

In [8]:
# Check for duplicates:
duplicates = tennis_data.duplicated()
number_of_duplicates = duplicates.sum()
print(f'Number of duplicate rows: {number_of_duplicates}')
Number of duplicate rows: 0

Filtering Vs Imputing¶

  • For the Data Analysis to be more precise, it is important to clean the dataset using the most optimum method because it can affect the analysis power.
  • This can be done through either Filtering .dropna() or Imputing .fillna() method. Usually filtering a small proportion of data is okay but since this dataset consists more than 60,000 rows, filtering may affect the quality of the data.
  • On the other end, imputing within limits can help preserve the sample without affecting the analysis significantly.
In [9]:
#provide a summary of the descriptive statistics of the entire tennis data
tennis_data.describe()
Out[9]:
Year Winner ID Winning Player's Height (in cm) Winner's Age Loser ID Losing Player's Height (in cm) Loser's Age Game Duration (minutes) Winner Aces Winner Service Points ... Winner Break Points Saved Winner Break Points Faced Loser Aces Loser Service Points Loser First Serves In Loser Service Games Played Loser Break Points Faced Loser Break Points Saved Winning Player's Rank Losing Player's Rank
count 67708.000000 67708.000000 66190.000000 67704.000000 67708.000000 64635.000000 67702.000000 60208.000000 61742.000000 61742.000000 ... 61742.000000 61742.000000 61742.000000 61742.000000 61742.000000 61743.000000 61742.000000 61742.000000 67211.000000 66388.000000
mean 2011.380250 108946.000177 186.200030 26.346919 109010.336873 185.667796 26.449805 106.949575 6.923504 77.945953 ... 3.451378 5.018010 5.113116 81.016342 48.665382 12.308294 4.766788 8.604516 79.346759 117.933256
std 6.541564 18394.042835 6.827807 3.971546 18438.182738 6.781806 4.088679 41.292377 5.553292 29.217210 ... 3.070134 4.024942 4.899089 29.215464 19.260585 4.232881 3.265035 4.138488 138.557775 186.546855
min 2001.000000 100644.000000 163.000000 14.900000 100644.000000 163.000000 14.500000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 1.000000
25% 2006.000000 103656.000000 183.000000 23.400000 103580.000000 183.000000 23.500000 77.000000 3.000000 56.000000 ... 1.000000 2.000000 2.000000 60.000000 35.000000 9.000000 2.000000 6.000000 18.000000 36.000000
50% 2011.000000 104468.000000 185.000000 26.100000 104468.000000 185.000000 26.300000 99.000000 6.000000 73.000000 ... 3.000000 4.000000 4.000000 76.000000 45.000000 11.000000 4.000000 8.000000 45.000000 68.000000
75% 2017.000000 105332.000000 190.000000 29.000000 105453.000000 190.000000 29.300000 130.000000 9.000000 94.000000 ... 5.000000 7.000000 7.000000 97.000000 59.000000 15.000000 7.000000 11.000000 85.000000 114.000000
max 2023.000000 211468.000000 211.000000 42.300000 212041.000000 211.000000 46.000000 1146.000000 113.000000 491.000000 ... 24.000000 30.000000 103.000000 489.000000 328.000000 91.000000 27.000000 38.000000 2101.000000 2159.000000

8 rows × 22 columns

In [10]:
# checking for all the null values in the dataset
tennis_data.isnull()
Out[10]:
Year Tournament ID Tournament Name Type of Surface Level of Tournament Winner ID Name of Winner Winning Player's Hand Winning Player's Height (in cm) Winning Player's Country ... Winner Break Points Saved Winner Break Points Faced Loser Aces Loser Service Points Loser First Serves In Loser Service Games Played Loser Break Points Faced Loser Break Points Saved Winning Player's Rank Losing Player's Rank
0 False False False False False False False False False False ... False False False False False False False False False False
1 False False False False False False False False False False ... False False False False False False False False False False
2 False False False False False False False False False False ... False False False False False False False False False False
3 False False False False False False False False False False ... False False False False False False False False False False
4 False False False False False False False False False False ... False False False False False False False False False False
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
67703 False False False False False False False False False False ... False False False False False False False False False False
67704 False False False False False False False False False False ... True True True True True True True True False False
67705 False False False False False False False False False False ... False False False False False False False False False False
67706 False False False False False False False False False False ... False False False False False False False False False False
67707 False False False False False False False False False False ... True True True True True True True True False False

67708 rows × 32 columns

Distribution of NaN values in the dataset¶

  • The following Python code uses the matplotlib and seaborn libraries to create a heatmap that visualizes the distribution of NaN (missing) values in the tennis_data DataFrame.

  • The Heatmap is a data technique, that can be used to show the distribution of NaN values across variables and observations.

In [11]:
plt.figure(figsize=(10, 12))
sns.heatmap(tennis_data.isna(), cbar=False)
plt.title('Heatmap of NaN Values in the Dataset')
plt.show()

From the heatmap we can reach the following observations:¶

1). NaN values are spread out evenly.
2). The columns on the right side have much more NaN values.
3). The Game duration variable has a large gap of NaN values.

Method 1: Filtering the Data¶

Using the .dropna() method to compare the descriptive statistics with the original data to check for disparities and the effect on the analysis.

In [12]:
# will drop all the rows with null values, axis = 0
drop_tennis_data_check = tennis_data.dropna()

# will describe the descriptive statistics of the tennis data after filtering the values.
drop_tennis_data_check.describe()
Out[12]:
Year Winner ID Winning Player's Height (in cm) Winner's Age Loser ID Losing Player's Height (in cm) Loser's Age Game Duration (minutes) Winner Aces Winner Service Points ... Winner Break Points Saved Winner Break Points Faced Loser Aces Loser Service Points Loser First Serves In Loser Service Games Played Loser Break Points Faced Loser Break Points Saved Winning Player's Rank Losing Player's Rank
count 58946.000000 58946.000000 58946.000000 58946.000000 58946.000000 58946.000000 58946.000000 58946.000000 58946.000000 58946.000000 ... 58946.000000 58946.000000 58946.000000 58946.000000 58946.000000 58946.000000 58946.000000 58946.000000 58946.000000 58946.000000
mean 2011.403725 108704.925780 186.249347 26.449637 108421.738456 185.673328 26.646149 107.206630 6.908136 77.949988 ... 3.455502 5.024226 5.115445 80.989957 48.639874 12.303973 4.765650 8.595766 59.395413 89.983867
std 6.564967 17800.676578 6.816407 3.934795 17058.869881 6.774238 3.994296 41.175354 5.530789 29.140404 ... 3.071682 4.022039 4.895310 29.156173 19.216700 4.222530 3.262421 4.133036 73.487907 110.709605
min 2001.000000 100644.000000 168.000000 15.800000 100644.000000 163.000000 15.400000 3.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 1.000000
25% 2006.000000 103607.000000 183.000000 23.600000 103507.000000 183.000000 23.700000 77.000000 3.000000 57.000000 ... 1.000000 2.000000 2.000000 60.000000 35.000000 9.000000 2.000000 6.000000 17.000000 35.000000
50% 2011.000000 104460.000000 185.000000 26.200000 104372.000000 185.000000 26.500000 100.000000 6.000000 73.000000 ... 3.000000 4.000000 4.000000 76.000000 45.000000 11.000000 4.000000 8.000000 42.000000 65.000000
75% 2017.000000 105238.000000 190.000000 29.100000 105376.000000 190.000000 29.400000 130.000000 9.000000 94.000000 ... 5.000000 7.000000 7.000000 97.000000 59.000000 15.000000 7.000000 11.000000 78.000000 104.000000
max 2023.000000 209950.000000 211.000000 42.300000 211663.000000 211.000000 44.000000 1146.000000 113.000000 491.000000 ... 24.000000 30.000000 103.000000 489.000000 328.000000 91.000000 27.000000 38.000000 1890.000000 2146.000000

8 rows × 22 columns

Method 2: Imputing the Data¶

Using the .fillna() method to compare the descriptive statistics with the original data to check for disparities and the effect on the analysis.

In [13]:
# will fill all the NaN values with 0, axis = 0
clean_tennis_data = tennis_data.fillna(0)

# will describe the descriptive statistics of the tennis data after filtering the values
clean_tennis_data.describe()
Out[13]:
Year Winner ID Winning Player's Height (in cm) Winner's Age Loser ID Losing Player's Height (in cm) Loser's Age Game Duration (minutes) Winner Aces Winner Service Points ... Winner Break Points Saved Winner Break Points Faced Loser Aces Loser Service Points Loser First Serves In Loser Service Games Played Loser Break Points Faced Loser Break Points Saved Winning Player's Rank Losing Player's Rank
count 67708.000000 67708.000000 67708.000000 67708.000000 67708.000000 67708.000000 67708.000000 67708.000000 67708.000000 67708.000000 ... 67708.000000 67708.000000 67708.000000 67708.000000 67708.000000 67708.000000 67708.000000 67708.00000 67708.000000 67708.000000
mean 2011.380250 108946.000177 182.025462 26.345362 109010.336873 177.241065 26.447461 95.102794 6.313449 71.077849 ... 3.147265 4.575855 4.662580 73.877695 44.377297 11.223947 4.346768 7.84634 78.764326 115.634091
std 6.541564 18394.042835 28.380673 3.976588 18438.182738 39.210845 4.096072 51.408822 5.654492 35.589375 ... 3.090682 4.098286 4.897643 36.134858 22.990836 5.339438 3.398067 4.64402 138.214358 185.437718
min 2001.000000 100644.000000 0.000000 0.000000 100644.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.00000 0.000000 0.000000
25% 2006.000000 103656.000000 183.000000 23.400000 103580.000000 180.000000 23.500000 69.000000 2.000000 52.000000 ... 1.000000 1.000000 1.000000 55.000000 32.000000 9.000000 2.000000 5.00000 17.000000 34.000000
50% 2011.000000 104468.000000 185.000000 26.100000 104468.000000 185.000000 26.300000 93.000000 5.000000 69.000000 ... 2.000000 4.000000 3.000000 72.000000 43.000000 11.000000 4.000000 8.00000 44.000000 67.000000
75% 2017.000000 105332.000000 190.000000 29.000000 105453.000000 190.000000 29.300000 125.000000 9.000000 92.000000 ... 5.000000 7.000000 7.000000 95.000000 57.000000 14.000000 6.000000 11.00000 85.000000 113.000000
max 2023.000000 211468.000000 211.000000 42.300000 212041.000000 211.000000 46.000000 1146.000000 113.000000 491.000000 ... 24.000000 30.000000 103.000000 489.000000 328.000000 91.000000 27.000000 38.00000 2101.000000 2159.000000

8 rows × 22 columns

Imputation Wins!!¶

Upon experimenting with both methods and also based on the heatmap visualization, it is evident that imputing values resulted in nearly identical descriptive statistics when compared to the original dataset. This suggests that imputation is a preferable option for preparing the data, ensuring the most accurate analysis possible.

In [14]:
print(clean_tennis_data.columns.tolist())
['Year', 'Tournament ID', 'Tournament Name', 'Type of Surface', 'Level of Tournament', 'Winner ID', 'Name of Winner', "Winning Player's Hand", "Winning Player's Height (in cm)", "Winning Player's Country", "Winner's Age", 'Loser ID', 'Name of Losing Player', "Losing Player's Hand", "Losing Player's Height (in cm)", "Losing Player's Country", "Loser's Age", 'Game Duration (minutes)', 'Winner Aces', 'Winner Service Points', 'Winner First Serves In', 'Winner Service Games Played', 'Winner Break Points Saved', 'Winner Break Points Faced', 'Loser Aces', 'Loser Service Points', 'Loser First Serves In', 'Loser Service Games Played', 'Loser Break Points Faced', 'Loser Break Points Saved', "Winning Player's Rank", "Losing Player's Rank"]

Data Analysis of ATP Stats from 2001 - 2023¶

This part of the project involves analysing the dataset where we answer some questions.

Let us answer our goals:

Goal 1¶

Do the Top Player perform better on any specific type of playing surface? And for games played between the top right-handed player and left handed players, what is the percentage of victories for each handness on different court type?¶

Needed Functions:¶
  • value_counts(): Returns an object containing counts of unique values. This means it enables us to count the number of unique elements in a column of a Pandas dataframe.
  • sort_values(): The default value of this function is to sort in ascending order. However, you can pass an argument to the function that disables the default behavior, which will return the values in descending order.
  • isin(): Checks if the Dataframe contains the specified value(s). It returns a DataFrame similar to the original DataFrame, but the original values have been replaced with True if the value was one of the specified values, otherwise False .
  • groupby():Grouping the data points based on the distinct values in the given column or columns. We can then calculate aggregated values for the generated groups.
  • merge: Updates the content of two DataFrame by merging them together

In order to do the analysis for this question, We will need to first find out what is our top player. Our team decided to consider the top five players that has the most win in the dataset as the top players that we do the analysis on. In order to get the number of the wins that the top player get on each type of surface, we will need to do some data filtering, select the needed column and count the filtered data.

  • Counting the filtered data: In order for us to have the number of win, we will need to use the function .value_counts() to count the filtered data.
  • Show the top five player: In order for us to show the top five player with the most number of win, we will need to use the function .sort_values(ascending = False) to make the list in a descending order.
In [15]:
sorted_win_count = clean_tennis_data['Name of Winner'].value_counts().sort_values(ascending=False)
sorted_win_count.head(10)

# Add columns specifying the type of surface and hand - either left or right.
Out[15]:
Roger Federer      1214
Novak Djokovic     1083
Rafael Nadal       1079
David Ferrer        740
Andy Murray         738
Tomas Berdych       643
Andy Roddick        608
Richard Gasquet     605
Marin Cilic         586
Stan Wawrinka       566
Name: Name of Winner, dtype: int64
In [16]:
# Filtering data for top players
top_players = clean_tennis_data['Name of Winner'].value_counts().head(5).index.tolist()
top_players_data = clean_tennis_data[clean_tennis_data['Name of Winner'].isin(top_players)]

# Analyzing performance on different surfaces
winner_surface_performance = top_players_data.groupby(["Winner ID",'Name of Winner', 'Type of Surface',"Winning Player's Hand"]).agg(
    Total_Wins=pd.NamedAgg(column='Name of Winner', aggfunc='count'),
)

# Resetting index for better readability
winner_surface_performance.reset_index(inplace=True)

top_winners_data = pd.DataFrame(winner_surface_performance)

After some data processing, we get the top five player's numbers of win on each type of surface. We can see that all the player win the most game on Hard surface.We think that the result is this way because all the matches are play mostly on hard surface, which doesn't represent that they perform better on hard surface. So we should conduct further analysis like finding the their number of losses are each type of surfaces and their winrate. To achieve this, we will have to calculate the number of times they appear on the Name of losing player and calculate their winrate.

In [17]:
loser_surface_performance = top_players_data.groupby(["Name of Losing Player", "Type of Surface","Losing Player's Hand"]).agg(
    Total_Lost=pd.NamedAgg(column="Name of Losing Player", aggfunc='count'),
)
loser_surface_performance.reset_index(inplace=True)
lost_count = pd.DataFrame(loser_surface_performance)
lost_count
# Merging the winner and loser dataframe by names and type of surfaces
win_percentage = winner_surface_performance.merge(lost_count, left_on=['Name of Winner', 'Type of Surface'], 
                             right_on=['Name of Losing Player', 'Type of Surface'])
# Drop the duplicate columns
win_percentage.drop(columns=["Name of Losing Player","Losing Player's Hand"], inplace = True)

# Find the total number of game played on each surface
win_percentage['Total_Game_Played'] = win_percentage['Total_Wins'] + win_percentage['Total_Lost']

# Filtering data for top players
top_players = clean_tennis_data['Name of Winner'].value_counts().head(5).index.tolist()
top_players_data = clean_tennis_data[clean_tennis_data['Name of Winner'].isin(top_players)]

# Analyzing performance on different surfaces
winner_surface_performance = top_players_data.groupby(["Winner ID",'Name of Winner', 'Type of Surface',"Winning Player's Hand"]).agg(
    Total_Wins=pd.NamedAgg(column='Name of Winner', aggfunc='count'),
)

# Resetting index for better readability
winner_surface_performance.reset_index(inplace=True)

# Filtering data for top players
top_players = clean_tennis_data['Name of Winner'].value_counts().head(5).index.tolist()
top_players_data = clean_tennis_data[clean_tennis_data['Name of Winner'].isin(top_players)]

# Analyzing performance on different surfaces
winner_surface_performance = top_players_data.groupby(["Winner ID",'Name of Winner', 'Type of Surface',"Winning Player's Hand"]).agg(
    Total_Wins=pd.NamedAgg(column='Name of Winner', aggfunc='count'),
)

# Resetting index for better readability
winner_surface_performance.reset_index(inplace=True)

top_winners_data = pd.DataFrame(winner_surface_performance)
top_winners_data


# Calculate 'Win_Percentage' column by dividing 'Total_Wins' by 'Total_Game_Played' and multiplying by 100
win_percentage['Win_Percentage'] = win_percentage['Total_Wins']/win_percentage['Total_Game_Played']*100
win_percentage
Out[17]:
Winner ID Name of Winner Type of Surface Winning Player's Hand Total_Wins Total_Lost Total_Game_Played Win_Percentage
0 103819 Roger Federer Clay R 224 18 242 92.561983
1 103819 Roger Federer Grass R 192 5 197 97.461929
2 103819 Roger Federer Hard R 761 40 801 95.006242
3 103970 David Ferrer Carpet R 9 1 10 90.000000
4 103970 David Ferrer Clay R 337 29 366 92.076503
5 103970 David Ferrer Grass R 44 1 45 97.777778
6 103970 David Ferrer Hard R 350 43 393 89.058524
7 104745 Rafael Nadal Clay L 478 14 492 97.154472
8 104745 Rafael Nadal Grass L 76 5 81 93.827160
9 104745 Rafael Nadal Hard L 523 42 565 92.566372
10 104918 Andy Murray Clay R 110 17 127 86.614173
11 104918 Andy Murray Grass R 119 5 124 95.967742
12 104918 Andy Murray Hard R 501 41 542 92.435424
13 104925 Novak Djokovic Clay R 275 28 303 90.759076
14 104925 Novak Djokovic Grass R 115 5 120 95.833333
15 104925 Novak Djokovic Hard R 684 35 719 95.132128
In [18]:
# Filtering matches where one player is right-handed and the other is left-handed
matches_rh_vs_lh_specific = tennis_data[((tennis_data['Winning Player\'s Hand'] == 'R') & (tennis_data['Losing Player\'s Hand'] == 'L')) |
                                        ((tennis_data['Winning Player\'s Hand'] == 'L') & (tennis_data['Losing Player\'s Hand'] == 'R'))]

# Group by the handedness of the winner and the type of surface, then count the wins
victories_by_hand_surface_specific = matches_rh_vs_lh_specific.groupby(['Winning Player\'s Hand', 'Type of Surface']).size().reset_index(name='Wins')

# Counting the total number of such matches played on each surface
total_matches_by_surface_specific = matches_rh_vs_lh_specific.groupby('Type of Surface').size().reset_index(name='Total Matches')

# Merging the wins with the total matches to calculate win percentages
victory_percentage_specific = pd.merge(victories_by_hand_surface_specific, total_matches_by_surface_specific, on='Type of Surface')
victory_percentage_specific['Win Percentage'] = (victory_percentage_specific['Wins'] / victory_percentage_specific['Total Matches']) * 100

# Display the DataFrame
victory_percentage_specific[['Type of Surface', 'Winning Player\'s Hand', 'Win Percentage']]
Out[18]:
Type of Surface Winning Player's Hand Win Percentage
0 Carpet L 44.186047
1 Carpet R 55.813953
2 Clay L 50.030272
3 Clay R 49.969728
4 Grass L 46.646341
5 Grass R 53.353659
6 Hard L 46.372599
7 Hard R 53.627401
In [19]:
#insert the dataframe into the graph
fig = px.bar(winner_surface_performance, 
             x='Name of Winner', 
             y='Total_Wins', 
             # Create the group bar chart
             color='Type of Surface',
             barmode='group',
             #modify the title for the interaction of the bar chart
             labels={'Name of Winner': 'Player Name', 'Total_Wins': 'Total Wins'},
             title='Performance of Top Tennis Players on Different Surfaces')
# Show the plot
fig.show()

Observation Based On the Visualization graph and Dataframe¶

Who is the dominating player and the worst player on Clay?¶

Based on the visualization graph and dataframe, we can see that Rafael Nadal dominate on the Clay surface by having a winrate of 97% with a total game of 492 which shows a overwhelming superiority over other player on Clay surface. Andy Murray has the least win rate with a total game of 127 on Clay which might indictae that he doesn't perform that well on the Clay surface.

Who is the dominating player and the worst player on Grass?¶

Based on the visualization graph and dataframe, we can see that David Ferrer and Roger Federer has the most win rate on Grass. David Ferrer has a win rate of 97.78% with a total game of 45 and Roger Federer has a win rate of 97.49% with a total game of 199 which make them both the most dominating player on Grass surface. But considering Roger Federer has a much larger sample size, so the win rate will be more accurate and has more stability.So we indicate that he is the dominating player on Grass Surface. Rafael Nadal has the lowest winrate of 93.83% with a total game of 81 on Grass Surface, eventhough it is still a very solid win rate, but it place him the lowest rating while playing on Grass surface comparing to the other members in the top 5 list.

Who is the dominating player and the worst player on Hard?¶

Based on the visualization graph and dataframe, we can see that Novak Djokovic and Roger Federer has the most win rate on Hard surface. Novak Djokovic has a win rate of 95.13% with a total game of 719 and Roger Federer has a win rate of 95.15% with a total game of 825 which make them both the most dominating player on Hard surface. David Ferrer has the lowest win rate which is 89.05% out of a total game of 393 which indicates that he doesn't perform very well on the hard surface.

What is the best and worst performing surface for Andy Murray?¶

Based on the visualization graph, Andy Murray perform much better on grass with a win rate of 95.97% which might be an indication that he will play better on grass surface. His worst performing surface is Clay where he only has 86.61% which indicate that he is not as good when he play on Clay.

What is the best and worst performing surface for David Ferrer?¶

Based on the visualization graph, David Ferrer perform much better on grass where he has a win rate of 97.78% which indicate that he will be performing much better on grass surface. His is not as good when he plays on hard surface with a winrate of 89.06% which serve as an evidence to the claim.

What is the best and worst performing surface for Novak Djokovic?¶

Based on the visualization graph, Novak Djokovic perform much better on grass with a win rate of 95.83% which indicate he will be in his best performance when playing on grass. On the other hand, he has a win rate of 90.76% on clay which shows that he is not as good when he play on clay.

What is the best and worst performing surface for Rafael Nadal?¶

Based on the visualization graph, Rafael Nadal perform much better on clay where he has a win rate of 97.15%. It indicate that he will be a much better self when playing on Clay. On Hard surface, he has the lowest win rate which is 92.57% which shows that he is not as good when he play on Hard surface.

What is the best and worst performing surface for Roger Federer?¶

Based on the visualization graph, Roger Federer perform much better on grass where he has a win rate of 97.49% which shows that he barely lose his game when he play on grass. On Clay surface, he has his lowest win rate which is 92.65% which shows that he is not as good when he play on Clay surface.

Inference¶

We see that players are not doing as well when they are playing on carpet surfaces this is also due to the fact that our data is old tennis is not being played on carpet in current time. However, we see that hard surfaces have the highest wins because most of the matches are being played on hard surfaces. This will help the sponsorship brand like Nike and Under Armour partner with a player and work on making equipment to help them improve their game. We would recommend Nike to work with Roger Federer and create a product like sneakers that will help play better on clay surfaces. We would recommend Under Armour to partner with David Ferrer to make a product that will improve him when playing on hard surfaces.

Goal 2¶

How significantly does a player's age impact their game performance, and can we associate the success of younger players and the difficulties faced by older players with factors like stamina and match duration?

In [20]:
#group the players by age, and then aggregate the average game duration and win counts, for use in visualization
group_by_age = clean_tennis_data.groupby("Winner's Age").agg(
    avg_dur=('Game Duration (minutes)', 'mean'),
    win_cnt=('Winner ID', 'count')).reset_index()

#create the scatter plot of age vs game duration
fig, ax1 = plt.subplots(figsize=(12, 6))
sns.scatterplot(x="Winner's Age", y='avg_dur', data=group_by_age, ax=ax1, color='blue', label='Average Game Duration')

#basic formatting of chart for title, axis labels, and upper/lower bounds of chart
ax1.set_xlabel('Age of Winner')
ax1.set_xlim(left=12, right=42)
ax1.set_ylabel('Avg Game Duration (minutes)', color='blue')
ax1.tick_params(axis='y', labelcolor='blue')
ax1.set_title('Age vs Game Duration and Number of Wins')

#create a secondary axis for number of wins, add line plot, and labels
ax2 = ax1.twinx()
sns.lineplot(x="Winner's Age", y='win_cnt', data=group_by_age, ax=ax2, color='orange', label='Number of Wins')
ax2.set_ylabel('Number of Wins', color='orange')
ax2.tick_params(axis='y', labelcolor='orange')

#place legends in the upper corners for easier viewing 
ax1.legend(loc='upper left')
ax2.legend(loc='upper right')

plt.show()

Inference¶

This information will help sport entertainment broadcasters understand how long a game might take depending on the age of the players. This will make the casting and streaming of the game more precise and they will be able to allocate a specific amount of time for the game and how much ad should be streamed during the game and when. For example if the players in a competition are between the ages of 40 - 44 they know that the match will take 111.178 minutes which is about 2 hours specifically 1 hour and 52 minutes. ESPN and BIeN Sports will know how long it will take and can plan what they will be showing after the match and have the line up of what will be playing on that day. The longer the game is the less time they will have to show other broadcasts.

Goal 3¶

Clash of legends¶

During the clash of the Legends i.e Djokovic/ Nadal or Djokovic/Federer or Federer/Nadal, which player proved to be a winner majority of the time against their opponent and what was the reason behind it? Is the court/season, hand, age and height the reason for the winner’s success?

Analysis based on tournaments

In [21]:
def clash(players):
    clash_data = clean_tennis_data[clean_tennis_data['Name of Winner'].isin(players) & clean_tennis_data['Name of Losing Player'].isin(players)]
    winners = clash_data['Name of Winner'].value_counts()
    tournament_winners_count = clash_data[['Tournament Name','Name of Winner']].value_counts().reset_index(name='Count')
    return clash_data, winners,tournament_winners_count
In [22]:
players = ['Novak Djokovic', 'Rafael Nadal', 'Roger Federer']
clash_data, winners, tournament_winners_count = clash(players)
In [23]:
winners
Out[23]:
Novak Djokovic    58
Rafael Nadal      53
Roger Federer     40
Name: Name of Winner, dtype: int64
In [24]:
tournament_winners_count
Out[24]:
Tournament Name Name of Winner Count
0 Roland Garros Rafael Nadal 14
1 Rome Masters Rafael Nadal 8
2 Tour Finals Novak Djokovic 7
3 Indian Wells Masters Novak Djokovic 6
4 Rome Masters Novak Djokovic 6
5 Australian Open Novak Djokovic 6
6 Wimbledon Novak Djokovic 5
7 Tour Finals Roger Federer 5
8 Monte Carlo Masters Rafael Nadal 5
9 Madrid Masters Rafael Nadal 4
10 US Open Novak Djokovic 4
11 Miami Masters Novak Djokovic 4
12 Wimbledon Roger Federer 4
13 Cincinnati Masters Roger Federer 3
14 Shanghai Masters Roger Federer 3
15 Paris Masters Novak Djokovic 3
16 Roland Garros Novak Djokovic 3
17 Cincinnati Masters Novak Djokovic 3
18 Indian Wells Masters Roger Federer 3
19 Monte Carlo Masters Roger Federer 3
20 Dubai Roger Federer 3
21 US Open Roger Federer 3
22 Australian Open Rafael Nadal 3
23 Australian Open Roger Federer 2
24 Tour Finals Rafael Nadal 2
25 Miami Masters Roger Federer 2
26 US Open Rafael Nadal 2
27 Beijing Novak Djokovic 2
28 Canada Masters Novak Djokovic 2
29 Wimbledon Rafael Nadal 2
30 Basel Roger Federer 2
31 Monte Carlo Masters Novak Djokovic 2
32 Miami Masters Rafael Nadal 2
33 Indian Wells Masters Rafael Nadal 2
34 Masters Cup Roger Federer 2
35 Hamburg Masters Rafael Nadal 2
36 Davis Cup WG R1: ESP vs SRB Rafael Nadal 1
37 Doha Novak Djokovic 1
38 Dubai Novak Djokovic 1
39 Dubai Rafael Nadal 1
40 Basel Novak Djokovic 1
41 Hamburg Masters Roger Federer 1
42 Cincinnati Masters Rafael Nadal 1
43 Madrid Masters Novak Djokovic 1
44 Davis Cup WG PO: SUI vs SCG Roger Federer 1
45 Beijing Olympics Rafael Nadal 1
46 Roland Garros Roger Federer 1
47 Madrid Masters Roger Federer 1
48 Masters Cup Rafael Nadal 1
49 Queen's Club Rafael Nadal 1
50 Canada Masters Roger Federer 1
51 Canada Masters Rafael Nadal 1
52 Atp Cup Novak Djokovic 1
In [25]:
custom_colors = {'Category1': 'blue', 'Category2': 'yellow', 'Category3': 'black'}

fig = px.bar(tournament_winners_count.reset_index(), x='Tournament Name', y='Count',
            labels={'Count': 'Number of Winners'},
            color='Name of Winner',
            title='Number of Winners per Tournament',
            hover_name='Count',
            color_discrete_map = custom_colors)

fig.update_layout(
        xaxis=dict(tickangle=-90, showgrid=False),
        yaxis=dict(showgrid=False),
        height=500,  
        width=800
    )
fig.show()

Analysis Based on Surface

In [26]:
def get_player_name(player_id, data):
    if player_id in data['Winner ID'].values:
        return data[data['Winner ID'] == player_id]['Name of Winner'].iloc[0]
    elif player_id in data['Loser ID'].values:
        return data[data['Loser ID'] == player_id]['Name of Losing Player'].iloc[0]
    else:
        return "Unknown Player"
In [27]:
def create_player_profile(player_id, data, surface=None):
    player_name = get_player_name(player_id, data)
    if surface and surface.strip():
        player_matches = data[((data['Winner ID'] == player_id) | (data['Loser ID'] == player_id)) & (data['Type of Surface'] == surface)]
    else:
        player_matches = data[(data['Winner ID'] == player_id) | (data['Loser ID'] == player_id)]
    
    total_matches = len(player_matches)
    total_wins = len(player_matches[player_matches['Winner ID'] == player_id])
    average_aces = (player_matches[(player_matches['Winner ID'] == player_id)]['Winner Aces'].mean() + 
                    player_matches[(player_matches['Loser ID'] == player_id)]['Loser Aces'].mean())
    total_bp_saved = player_matches[(player_matches['Winner ID'] == player_id)]['Winner Break Points Saved'].mean() + player_matches[(player_matches['Loser ID'] == player_id)]['Loser Break Points Saved'].mean()
    total_bp_faced = player_matches[(player_matches['Winner ID'] == player_id)]['Winner Break Points Faced'].mean() + player_matches[(player_matches['Loser ID'] == player_id)]['Loser Break Points Faced'].mean()
    
    profile = {
        'Player ID': player_id,
        'Player Name': player_name,
        'Total Matches': total_matches,
        'Total Wins': total_wins,
        'Average Aces per Match': average_aces,
        'BP Saved': total_bp_saved,
        'BP Faced': total_bp_faced
    }

    return profile
In [28]:
def compare_players(player_ids, data, surface=None):
    profiles = [create_player_profile(player_id, data, surface) for player_id in player_ids]

    # Prepare data for the bar chart
    categories = ['Total Matches', 'Total Wins', 'Average Aces per Match', 'BP Saved', 'BP Faced']
    traces = []
    colors = ['violet', 'orange', 'blue']

    for i, player_id in enumerate(player_ids):
        player_values = [profiles[i][cat] for cat in categories]
        trace = go.Bar(x=categories, y=player_values, name=f'{profiles[i]["Player Name"]} (ID: {player_id})', marker_color=colors[i])
        traces.append(trace)

    # Create the figure
    fig = go.Figure(data=traces)
    fig.update_layout(title_text='Head-to-Head comparison of players', barmode='group')

    return fig
In [29]:
def filter_head_to_head_matches(player_ids, data, surface=None):
    if surface:
        surface = surface.lower()
        head_to_head = data[
            (((data['Winner ID'] == player_ids[0]) & (data['Loser ID'].isin(player_ids[1:])) |
              (data['Winner ID'] == player_ids[1]) & (data['Loser ID'].isin([player_ids[0], player_ids[2]])) |
              (data['Winner ID'] == player_ids[2]) & (data['Loser ID'].isin([player_ids[0], player_ids[1]]))) &
             (data['Type of Surface'].str.lower() == surface))
        ]
    else:
        head_to_head = data[
            (((data['Winner ID'] == player_ids[0]) & (data['Loser ID'].isin(player_ids[1:])) |
              (data['Winner ID'] == player_ids[1]) & (data['Loser ID'].isin([player_ids[0], player_ids[2]])) |
              (data['Winner ID'] == player_ids[2]) & (data['Loser ID'].isin([player_ids[0], player_ids[1]]))))
        ]

    return head_to_head
In [30]:
def create_player_head_to_head_profile(player_id, head_to_head_matches):
    total_matches = len(head_to_head_matches)
    total_wins = len(head_to_head_matches[head_to_head_matches['Winner ID'] == player_id])
    average_aces = head_to_head_matches[head_to_head_matches['Winner ID'] == player_id]['Winner Aces'].mean()

    win_bp_saved = head_to_head_matches[head_to_head_matches['Winner ID'] == player_id]['Winner Break Points Saved'].sum()
    win_bp_faced = head_to_head_matches[head_to_head_matches['Winner ID'] == player_id]['Winner Break Points Faced'].sum()
    loser_bp_saved = head_to_head_matches[head_to_head_matches['Loser ID'] == player_id]['Loser Break Points Saved'].sum()
    loser_bp_faced = head_to_head_matches[head_to_head_matches['Loser ID'] == player_id]['Loser Break Points Faced'].sum()

    profile = {
        'Player ID': player_id,
        'Total Matches': total_matches,
        'Total Wins': total_wins,
        'Average Aces per Match': average_aces,
        'Winner BP Saved': win_bp_saved,
        'Winner BP Faced': win_bp_faced,
        'Loser BP Saved': loser_bp_saved,
        'Loser BP Faced': loser_bp_faced,
    }

    return profile

# Example usage with user input for three players
surface = input("Enter the surface type (Clay, Grass, Hard) or leave blank for all surfaces: ").strip()
player1_id = 103819  # Replace with the actual ID of the first player
player2_id = 104745  # Replace with the actual ID of the second player
player3_id = 104925  # Replace with the actual ID of the third player
player_ids = [player1_id, player2_id, player3_id]

# Filter for head-to-head matches on the specified surface
head_to_head_matches = filter_head_to_head_matches(player_ids, clean_tennis_data, surface)

# Create profiles based on head-to-head matches
player1_profile = create_player_head_to_head_profile(player1_id, head_to_head_matches)
player2_profile = create_player_head_to_head_profile(player2_id, head_to_head_matches)
player3_profile = create_player_head_to_head_profile(player3_id, head_to_head_matches)

# Compare and visualize the profiles
fig = compare_players(player_ids, head_to_head_matches, surface)
fig.show()

Inference¶

Based on the graph, we can infer that companies should consider players as the face of their brand based on the number of wins a player had in that geographic location where the company has a majority audience as compared to a player who has won multiple matches overall.

For client Lacoste: Lacoste would benefit from choosing Nadal as their brand ambassador for France, even though he has lost more matches than other athletes. Nadal's impressive record at French tournaments like Roland Garros and Monte Carlo Masters makes him a strong choice for reaching the French audience.

For client Sergio Tacchini: Sergio Tacchini would benefit from choosing Nadal as their brand ambassador for Italy, even though he has lost more matches than other athletes. Nadal's impressive record at Italian tournaments like Rome Masters held in Rome. Also in Tour Finals held in Turin, Nadal is the only player to have won.

For clients Under Armour and Nike: Under Armour and Nike would benefit from choosing Novak Djokovic as their brand ambassador for USA. Djokovic's impressive record at American tournaments like Indian Wells Masters, Miami Masters, Cincinnati Masters makes him a perfect candidate

This suggests that companies prioritize local relevance and cultural connection over global dominance when choosing brand ambassadors. From the above graph we can interpret that there is no one major winner that has majorly won. Based on various tournaments and locations there are different winners so assigning one winner to every company is not correct. Hence every company based on their geographic location would be better if they keep their geographic winner as their brand ambassador which would attract more crowds as compared to a player who has won more but has not been successful geographically.

Goal 4¶

Does the number of aces (ace means when a serve is in but not hit at all by the receiving player) served by a player in a game directly depend on the outcome? (more aces = winner)? Why is that so?¶

In [31]:
# Add a column to the data to better visualize the data
clean_tennis_data['Point Color'] = clean_tennis_data.apply(lambda row: 'blue' if row['Winner Aces'] > row['Loser Aces'] else 'orange', axis=1)

# Create a jointplot with scatter plot and area chart
jointplot = sns.jointplot(data=clean_tennis_data, x='Winner Aces', y='Loser Aces', hue='Point Color', 
                          palette={'blue': 'blue', 'orange': 'orange'}, kind="scatter", 
                          xlim=(0, 60), ylim=(0, 60), height=8, ratio=4, marginal_ticks=True, legend=False)

# Add a trend line to the scatter plot 
sns.regplot(data=clean_tennis_data, x='Winner Aces', y='Loser Aces', scatter=False, ax=jointplot.ax_joint, color='#000000')

# Format the chart 
plt.suptitle('Winner vs Loser Aces in Tennis Matches', y=1.02)
plt.xlabel('Winner Aces')
plt.ylabel('Loser Aces')

plt.show()

Inference¶

The main motive of this analysis is to determine a relationship between the number of aces delivered during a match by a particular player and whether that accounts for the player's victory. This relationship is showcased using the joint plot function of the seaborn library with a scatter plot and marginal histograms. The points on the plot are color-coded based on the number of aces scored by the winner and the loser. Blue points are used to indicate the winners with more aces while orange points are used to indicate the opposite scenario.

Out of the two histograms, the Top Histogram is used to showcase the winner aces while the Right Histogram depicts the loser aces. Through the trend line, it can be observed that there is a positive relationship between the winner and the loser aces. Through the plot, the majority of the blue points suggest that there is a consistent trend where the winner of the game has played more aces than the loser. On the other side, the orange points depict that the number of matches in which the loser scored more aces is significantly in the minority. Hence, to sum it up we can say that aces play a crucial role in tennis matches and can help in strategic dominance over the opponents. Through the graph, it can be stated that the winner tends to have more aces than the loser in the matches. Hence for most cases, More Aces = Winner

This information holds value for the betting company because it will let people choose the player to bet on depending on the number of aces being scored in the match. Betting companies nowadays allow real-time options as well to place bets, the bettors can keep this factor in mind when placing a bet. When Draftkings Sportsbook shows the data provided by us then bettors are able to see that in the past the winner of a specific game usually has the most aces. This information will allow betters to place a bet in which they can guess what the amount of aces and who will win. Also, this information carries a lot of importance for our clients as the sports broadcasters can use this analysis for commentary purposes and compare the tendency of playing aces for particular players.

Goal 5 - Tennis Match Winner Prediction Using Machine Learning¶

The cells below use machine learning to predict winner between 2 players from the clean tennis dataset.

  • The process of predicting the winner has the following steps:

1). Importing necessary libraries¶

In [32]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder

2). Data Processing¶

  • Winners and losing player are combined together to form a list of unique players in the dataset.
  • We use LabelEncoder to convert player names to numerical codes, making them easier and suitable for machine learning.
  • The surface column is one-hot encoded to handle categorical data, representing the match surface type.
In [33]:
# Make a column of unique player names
all_player_names = pd.concat([clean_tennis_data['Name of Winner'], clean_tennis_data['Name of Losing Player']]).unique()
label_encoder = LabelEncoder()
label_encoder.fit(all_player_names)

clean_tennis_data['winner_encoded'] = label_encoder.transform(clean_tennis_data['Name of Winner'])
clean_tennis_data['loser_encoded'] = label_encoder.transform(clean_tennis_data['Name of Losing Player'])

# One-hot encode the surface type
surface_encoded = pd.get_dummies(clean_tennis_data['Type of Surface'], prefix='Surface')
clean_tennis_data = pd.concat([clean_tennis_data, surface_encoded], axis=1)

# Add Aces and Break Points Saved features
clean_tennis_data['total_aces'] = clean_tennis_data['Winner Aces'] + clean_tennis_data['Loser Aces']
clean_tennis_data['total_bp_saved'] = clean_tennis_data['Winner Break Points Saved'] + clean_tennis_data['Loser Break Points Saved']

3). Feature Engineering¶

  • We use different features to train a machine learning model called RandomForestClassifier like: Name of Winning Player, Name of Losing Player, Number of Break Points saved by winners and losers and Number of Aces for winners and losers
In [34]:
# Prepare features and target
features = clean_tennis_data[['winner_encoded', 'loser_encoded', 'total_aces', 'total_bp_saved'] + list(surface_encoded.columns)]

# Create binary target variable
clean_tennis_data['target'] = clean_tennis_data.apply(lambda x: 1 if x['winner_encoded'] < x['loser_encoded'] else 0, axis=1)

4). Splitting the data for Training and Testing the model¶

  • The dataset is divided into training and testing sets for model validation.
In [35]:
# Splitting the data for training and testing.
X_train, X_test, y_train, y_test = train_test_split(features, clean_tennis_data['target'], test_size=0.2, random_state=42)

5). Training the model¶

  • The RandomForestClassifier is used as it is a popular classification model.
In [36]:
# Training the model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
Out[36]:
RandomForestClassifier(random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestClassifier(random_state=42)

6). Evaluating the accuracy of the model and making the prediction¶

  • The model's accuracy is assessed using the test set.
  • We make a user defined function that predicts the winner based on the different parametres like the data for 2 players, type of surface and features.
In [37]:
# Evaluating the model accuracy
predictions = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, predictions))
def predict_match_outcome(player1, player2, surface, model, label_encoder, df, feature_columns):
    player1_encoded = label_encoder.transform([player1])[0]
    player2_encoded = label_encoder.transform([player2])[0]
    surface_vector = [1 if surface == s else 0 for s in feature_columns if s.startswith('Surface_')]

    # Determine the number of additional features (excluding player and surface features)
    num_additional_features = len(feature_columns) - 2 - len(surface_vector)

    # Construct the feature vector
    additional_features = [0] * num_additional_features  # Placeholder for additional features
    features = [player1_encoded, player2_encoded] + surface_vector + additional_features

    # Make sure the number of features matches what the model was trained on
    assert len(features) == len(feature_columns), "Feature vector does not match the expected length"
    
     # Make a prediction
    prediction = model.predict([features])[0]

    # Interpret the result
    return player1 if prediction == 1 else player2
Accuracy: 0.995938561512332

7). Testing the model using examples¶

  • We test the predict_match_outcome to predict the winner between 2 players. For example use, we've used Andy Murray along with Daniil Medvedev on grass surface.
In [38]:
feature_columns = X_train.columns  # Assuming X_train is the DataFrame of training features
predicted_winner = predict_match_outcome('Andy Murray', 'Daniil Medvedev', 'grass', model, label_encoder, df, feature_columns)
print("Predicted Winner:", predicted_winner)
Predicted Winner: Andy Murray
C:\Users\ASUS\anaconda3\lib\site-packages\sklearn\base.py:420: UserWarning:

X does not have valid feature names, but RandomForestClassifier was fitted with feature names

Inference¶

  • Based on the predictive model, We can find the potential winner for a future match between 2 players. Looking at the example between Andy Murray and Daniil Medvedev playing on grass, the model predicted Murray winning.

Conclusions¶

Summarize your findings and explain why the results are meaningful.

Our main objective as an analytics company is to empower our clients by delivering actionable insights, promoting wise decision-making, and eventually assisting in their businesses' greater profitability . For DraftKing Sportsbook: Based on the analysis we have can suggest which players should Draftking suggest its clients to bet on and as a result the client’s profit will help the growth of the company.

For Nike, Under Armour, Lacoste, Sergio Tacchini : Our Analysis will help increase the sales and reach of the brands in the regions and hence as a result it will foster business development and growth of our client.

For Broadcasting companies: ESPN and Bein sports require their content to reach to a large number of viewers and as a result our insights will help them reach wider audience.

"We pledge on our honor that we have not given nor received any unauthorized assistance on this assignment."¶

- Team SmithServe¶

In [ ]: